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Amendments to the Claims : 

This listing of claims will replace all prior versions, and listings, of claims in the 

application: 

Listing of Claims : 

1 . (currently amended) A method for crawling documents comprising: 
receiving a uniform resource locator (URL); and 

receiving at least two different copies of a document associated with the 
URL: and 

determining whether the URL i s associated w i th a web site corresponding 
to the URL that uses session identifiers based , at l e ast in part, on a comparison 
of a portion of URLs that are within the document and that change between the 
at least two different copies of [[a]] the document download e d from th e w e b sit e. 

2. (original) The method of claim 1, wherein the document is a home 
page of the web site. 

3. (currently amended) The method of claim 1 , further comprising: 
extracting, when the URL i s assoc i ated with corresponds to a web site that 

uses session identifiers, a session identifier from the URL to obtain a clean URL; 
and 

determining whether the URL has already been crawled based , at l e ast i n 
partr on a comparison of the clean URL to a set of clean URLs that represent 
previously crawled URLs. 
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4. (currently amended) The method of claim 3, wherein th e portion of 
the compared URLs that change aro idont i f i od using include URLs that are local 
to the web site. 

5. (original) The method of claim 3, wherein the session identifiers 
from the URLs are extracted using rules for the web site. 

6. (original) The method of claim 5, wherein the rules are determined 
automatically. 

7. (original) The method of claim 3, further comprising: 
receiving the URL as a URL from a previously crawled web document. 

8. (original) The method of claim 3, further comprising: 

crawling the URL when the URL is determined to not already have been 
crawled. 

9. (currently amended) The method of claim 1 , wherein the 
comparison determines that the web site uses session identifiers when [[the]] a 
portion of the URLs that change between the at least two different copies of the 
document is greater than a predetermined value. 



10. (original) A method for identifying web sites that use session 
identifiers comprising: 
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downloading at least two different copies of at least one document from a 
web site; 

extracting uniform resource locators (URLs) from the two different copies 
of the web document; 

comparing the extracted URLs of the two different copies of the document; 

and 

determining whether the web site uses session identifiers based on the 
comparison. 

1 1 . (original) The method of claim 1 0, wherein the determining whether 
the web site uses session identifiers further includes: 

determining that the web site uses session identifiers when the 
comparison indicates that at least a predetermined portion of the URLs change 
between the two different copies. 

12. (original) The method of claim 10, wherein extracting URLs from 
the two different copies of the document includes extracting only URLs that are 
local to the web site. 

13. (original) The method of claim 10, wherein the document is a home 
page of the web site. 

14. (original) The method of claim 10, further comprising: 
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analyzing the extracted URLs, when the web site is determined to use 
session identifiers, to generate at least one rule identifying how the session 
identifiers are embedded in the URLs. 

15. (original) A device comprising: 

a spider component configured to crawl web documents associated with at 
least one web site; and 

a session identifier component configured to determine whether the web 
site uses session identifiers based on a comparison of a portion of uniform 
resource locators (URLs) that change between different copies of at least one 
web document downloaded from the web site. 

16. (original) The device of claim 15, wherein the spider component 
further comprises: 

at least one fetch component configured to download content from a 
network; and 

a content manager configured to extract URLs from the downloaded 
content. 

1 7. (original) The device of claim 1 6, wherein the spider component 
further comprises: 

a URL manager configured to store the extracted URLs. 
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1 8. (original) The device of claim 1 5, wherein the at least one web 
document is a home page of the web site. 

19. (original) The device of claim 15, wherein the portion of the URLs 
that change are identified from URLs that are local to the web site. 

20. (original) The device of claim 15, further comprising: 

a session rule generator configured to generate rules describing how the 
web site embeds session identifiers in the at least one web document. 

21. (original) A device comprising: 

means for downloading at least two different copies of at least one web 
document from a web site; 

means for extracting uniform resource locators (URLs) from the two 
different copies of the web document; 

means for comparing the extracted URLs of the two different copies of the 
web document; and 

means for determining whether the web site uses session identifiers 
based on the comparison. 

22. (original) The device of claim 21 , wherein the means for 
determining determines that the web site uses session identifiers when the 
comparison indicates that at least a predetermined portion of the URLs change 
between the two different copies. 
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23. (original) The device of claim 21 , wherein the means for extracting 
URLs from the two different copies of the web document includes means for 
extracting only URLs that are local to the web site. 

24. (original) The device of claim 21, wherein the web document is a 
home page of the web site. 

25. (original) The device of claim 21 , further comprising: 

means for analyzing the extracted URLs, when the web site is determined 
to use session identifiers, to generate rules describing how the session identifiers 
are embedded in the URLs. 

26. (original) A computer-readable medium containing programming 
instructions that when executed by at least one processor cause the processor to 
perform a method for identifying web sites that use session identifiers including: 

downloading at least two different copies of at least one document from a 
web site; 

extracting uniform resource locators (URLs) from the two different copies 
of the document; 

comparing the extracted URLs of the two different copies of the web 
document; and 

determining whether the web site uses session identifiers based on the 
comparison. 
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27. (original) The computer-readable medium of claim 26, wherein the 
determining when the web site uses session identifiers further includes: 

determining that the web site uses session identifiers when the 
comparison indicates that at least a predetermined portion of the URLs change 
between the two different copies. 

28. (original) The computer-readable medium of claim 26, wherein 
extracting URLs from the two different copies of the web document includes 
extracting only URLs that are local to the web site. 

29. (original) The computer-readable medium of claim 26, wherein the 
web document is a home page of the web site. 

30. (original) The computer-readable medium of claim 26, further 
comprising instructions that cause the at least one processor to: 

analyze the extracted URLs, when the web site is determined to use 
session identifiers, to generate at least one rule describing how the session 
identifiers are embedded in the URLs. 

31 . (currently amended) A method for crawling documents comprising: 
receiving a uniform resource locator (URL); and 

determining whether the URL is associated with a web site that uses 
session identifiers based , at le ast i n part, on a comparison of content between 
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different duplicate or near-duplicate copies of a document downloaded from the 
web site for two d i ff e r e nt URLs . 
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